Skip to content

ENH: Supporting REPEATED schema for list types #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

txomon
Copy link

@txomon txomon commented Jun 22, 2017

Basically, at the moment if there are lists inside a column, we have no way to put that data in GBQ. This enables REPEAT mode in a simple way.

@codecov-io
Copy link

codecov-io commented Jun 22, 2017

Codecov Report

Merging #60 into master will decrease coverage by 45.14%.
The diff coverage is 44.44%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #60       +/-   ##
===========================================
- Coverage   73.44%   28.29%   -45.15%     
===========================================
  Files           4        4               
  Lines        1540     1548        +8     
===========================================
- Hits         1131      438      -693     
- Misses        409     1110      +701
Impacted Files Coverage Δ
pandas_gbq/tests/test_gbq.py 27.89% <20%> (-54.99%) ⬇️
pandas_gbq/gbq.py 19.62% <75%> (-55.71%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852b2a3...30c4248. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jun 22, 2017

tests!

@txomon
Copy link
Author

txomon commented Jun 30, 2017

I don't get what if failing, the test is there but doesn't look like it's being executed.... 🤔

@max-sixty
Copy link
Contributor

max-sixty commented Jun 30, 2017

FWIW it's not 'pandantic' to put lists in a dataframe, and IIRC if they're on an index you'll face issues with a lot of pandas functionality

We approach this with either:

  1. separate dataframes uploaded to BQ, and then run a BQ query to merge them into a nested format in a single table
  2. stack / unstack, with a BQ query to merge columns

@txomon
Copy link
Author

txomon commented Jul 3, 2017

@MaximilianR I guessed it is not, but is something we use, thankfully not too often, but still need.

  1. Should we put a warning or something to warn about a list schema being generated?
  2. Can you tell me why the test isn't being executed?

@max-sixty
Copy link
Contributor

What makes you think it's not being executed?

@parthea parthea requested a review from tswast July 11, 2017 03:45
@parthea
Copy link
Contributor

parthea commented Jul 11, 2017

Thanks @txomon! I think #25 may help us provide support for BigQuery repeated fields. I'd like to keep this PR open just in case #25 doesn't provide this functionality.

Regarding the integration tests being skipped, some integration tests will be skipped if a BigQuery project id is not set. I think the tests should fail if a project id is not set. I've created #72 to track this improvement.

Follow these steps to run the BigQuery integration tests on Travis:
https://pandas-gbq.readthedocs.io/en/latest/contributing.html#running-google-bigquery-integration-tests

@tswast
Copy link
Collaborator

tswast commented Jul 11, 2017

Yeah, #25 may help. This is also something to make sure we handle when we switch from streaming writes to bulk upload #7.

@max-sixty
Copy link
Contributor

Anyone know whether this is still an issue?

@tswast
Copy link
Collaborator

tswast commented Apr 3, 2018

A similar issue was fixed for read_gbq in #134.

I think this PR would probably be useful if we wanted to support list types in to_gbq, but the fix would be more involved. (Would need to switch to a different file type than CSV for load jobs, maybe Parquet?)

@max-sixty
Copy link
Contributor

Right, and I think then pandas is not the right tool to be using for writing nested items - lists are badly supported if at all (I'm doubtful that it can write to nested parquet files, for example, so we'd have to write that part too).

We've thought about something that would go from xarray (i.e. 3+ dimensions) to a nested format to BQ, but even then there are difficulties (which dimension becomes nested? How can you create the JSON / dict in a memory-efficient way in python?)

I would vote to close, but v open if there are any creative ideas

@max-sixty
Copy link
Contributor

Closing as stale, but open to a solution for writing repeated fields

@max-sixty max-sixty closed this Aug 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants